Preparations

Load the necessary libraries

library(car)       #for regression diagnostics
library(broom)     #for tidy output
library(broom.mixed)
library(ggfortify) #for model diagnostics
library(sjPlot)    #for outputs
library(knitr)     #for kable
library(effects)   #for partial effects plots
library(emmeans)   #for estimating marginal means
library(ggeffects)  #for partial effects plots
library(MASS)      #for glm.nb
library(MuMIn)     #for AICc
library(nlme)
library(lme4)      #for lmer
library(lmerTest)  #for satterthwaite p-values with lmer
library(performance) #for residuals diagnostics
library(see)         #for plotting residuals
#library(pbkrtest)  #for kenward-roger p-values with lmer
library(glmmTMB)   #for glmmTMB
library(DHARMa)   #for residuals and diagnostics
library(tidyverse) #for data wrangling

Scenario

A plant pathologist wanted to examine the effects of two different strengths of tobacco virus on the number of lesions on tobacco leaves. She knew from pilot studies that leaves were inherently very variable in response to the virus. In an attempt to account for this leaf to leaf variability, both treatments were applied to each leaf. Eight individual leaves were divided in half, with half of each leaf inoculated with weak strength virus and the other half inoculated with strong virus. So the leaves were blocks and each treatment was represented once in each block. A completely randomised design would have had 16 leaves, with 8 whole leaves randomly allocated to each treatment.

Tobacco plant

Sampling design

Format of tobacco.csv data files

LEAF TREAT NUMBER
1 Strong 35.898
1 Week 25.02
2 Strong 34.118
2 Week 23.167
3 Strong 35.702
3 Week 24.122
... ... ...
LEAF The blocking factor - Factor B
TREAT Categorical representation of the strength of the tobacco virus - main factor of interest Factor A
NUMBER Number of lesions on that part of the tobacco leaf - response variable

Read in the data

tobacco = read_csv('../public/data/tobacco.csv', trim_ws=TRUE)
glimpse(tobacco)
## Rows: 16
## Columns: 3
## $ LEAF      <chr> "L1", "L1", "L2", "L2", "L3", "L3", "L4", "L4", "L5", "L5",…
## $ TREATMENT <chr> "Strong", "Weak", "Strong", "Weak", "Strong", "Weak", "Stro…
## $ NUMBER    <dbl> 35.89776, 25.01984, 34.11786, 23.16740, 35.70215, 24.12191,…

Exploratory data analysis

Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\boldsymbol{\beta} \bf{X_i} + \boldsymbol{\gamma} \bf{Z_i} \]

where \(\boldsymbol{\beta}\) and \(\boldsymbol{\gamma}\) are vectors of the fixed and random effects parameters respectively and \(\bf{X}\) is the model matrix representing the overall intercept and effects of the treatment on the number of lesions. \(\bf{Z}\) represents a cell means model matrix for the random intercepts associated with leaves.

We will start by explicitly declaring the categorical variable (TREATMENT) as a factor. In addition, random effects (in this case LEAF) should also be declared as factors.

tobacco = tobacco %>%
  mutate(LEAF=factor(LEAF),
         TREATMENT=factor(TREATMENT))

To explore the assumptions of homogeneity of variance and normality, a boxplot of each Treatment level is appropriate.

ggplot(tobacco,  aes(y=NUMBER,  x=TREATMENT)) +
  geom_boxplot()

Conclusions:

  • both normality and homogeneity of variance seem satisfied

It can also be useful to get a sense of the consistency across blocks (LEAF). That is, do all Leaves have a similar baseline level of lesion susceptibility and do they respond similarly to the treatment.

ggplot(tobacco,  aes(y=NUMBER,  x=as.numeric(LEAF))) +
  geom_line(aes(linetype=TREATMENT))

## If we want to retain the original LEAF labels
ggplot(tobacco,  aes(y=NUMBER,  x=as.numeric(LEAF))) +
  geom_blank(aes(x=LEAF)) +
  geom_line(aes(linetype=TREATMENT))

Conclusions:

  • it is clear that some leaves are more susceptible to lesions (e.g. Leaf 7) than other leaves (e.g. Leaf 4)
  • most leaves (other than Leaf 4 and 6) have a similar response to the Treatments - that is most have higher number of lesions from the Strong Treatment than the Weak Treatment.

Given that there are only two levels of Treatment (Strong and Weak), it might be easier to visualise the differences in baselines and effect consistency by plotting as:

ggplot(tobacco,  aes(y=NUMBER,  x=TREATMENT,  group=LEAF)) +
  geom_point() +
  geom_line(aes(x=as.numeric(TREATMENT))) 

Conclusions:

  • this figure reiterates the points made earlier about the varying baselines and effect consistency.

The above figure also serves as a good way to visualise certain aspects of mixed effects models. When we fit a mixed effects model that includes a random blocking effect (in this case LEAF), we are indicating that we are allowing there to be a different intercept for each block (LEAF). In the current case, the intercept will represent the first Treatment level (Strong). So the random effect is specifying that the intercept can vary from Leaf to Leaf.

We can think of the model as having two tiers (a hierarchy), where the tiers of the hierarchy represent progressively smaller (typically) spatial scales. In the current example, the largest spatial units are the leaves (blocking factor). Within the leaves, there are the two Treatments (Strong and Weak) and within the Treatments are the individual observations.

We tend to represent this hierarchy upside down in the model formula:

\[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\beta_0 + \boldsymbol{\beta} \bf{X_i}\\ \beta_0 = \boldsymbol{\gamma} \bf{Z_i} \]

In addition to allowing there to be a different intercept per leaf, we could allow there to be a different magnitude of effect (difference between Strong and Week Treatment) per leaf. That is considered a random slope. From the figure above, there is some evidence that the effects (slopes) may vary from Leaf to Leaf.

Incorporating a random slope (in addition to a random intercept), may reduce the amount of unexplained variance and thus improve the power of the main effect (Treatment effect).

Fit the model

Model validation

Partial plots

Model investigation / hypothesis testing

Predictions

Summary figures

References